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Abstract 

The recently proposed Minimal Complexity Machine (MCM) finds a hyperplane 
classifier by minimizing an exact bound on the Vapnik-Chervonenkis (VC) di¬ 
mension. The VC dimension measures the capacity of a learning machine, and 
a smaller VC dimension leads to improved generalization. On many benchmark 
datasets, the MCM generalizes better than SVMs and uses far fewer support 
vectors than the number used by SVMs. In this paper, we describe a neural 
network based on a linear dynamical system, that converges to the MCM solu¬ 
tion. The proposed MCM dynamical system is conducive to an analogue circuit 
implementation on a chip or simulation using Ordinary Differential Equation 
(ODE) solvers. Numerical experiments on benchmark datasets from the UCI 
repository show that the proposed approach is scalable and accurate, as we 
obtain improved accuracies and fewer number of support vectors (upto 74.3% 
reduction) with the MCM dynamical system. 


Keywords. Linear Programming, Neural Network, VC Dimension, Minimal 
Complexity Machine, Neurodynamical Systems 

1. Introduction 

Support vector machines (SVMs) have evolved to become one of the most 
widely used machine learning techniques today owing. They have also been em¬ 
ployed for a number of applications to obtain cutting edge performance; novel 
uses have also been devised, where their utility has been amply demonstrated. 
The classical SVM [9] and the least squares SVM (LSSVM) [32] have spawned 
a multitude of formulations. Most SVM formulations require the solution of a 
Quadratic Programming Problem (QPP), involving an objective function max¬ 
imizing the margin (with a term for the admissible error in case of soft-margin 
SVM) and suitable constraints. The solution to such an optimization problem 
is obtained in terms of a separating hyperplane, the determination of which is 
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a direct consequence of the number of support vectors identified in the dataset. 
Practical machine learning problems of today involve large datasets, and effi¬ 
cient real-time performance of learning systems demands the use of learning 
algorithms which minimize the learning complexity, in terms of space, time or 
both. 

The complexity of learning systems, such as SVMs, can be estimated by the 
Vapnik-Chervonenkis (VC) dimension. A smaller value of the VC dimension 
indicates robust generalization and lower test set error rates; hence a large VC 
dimension would be undesirable. As stated in pioneering work by Vapnik [34], 
Burges |3] and others, SVMs can have a large, possibly infinite VC dimension, 
which could also be infinite. This implies that SVMs may work well in practice, 
but there is no guarantee that they will generalize well. In fact, Vapnik and 
Chervonenkis |35j arrive at a bound on the stochastic approximation of the 
empirical risk, as given by Equations Q-i): which holds with probability (1 — 

v)- 


R{X) < RempiX) + ^ 

1 h{ln^ + 1) - Ini 

(1) 

1 

Where, Rempi^^') — 

- y,\, 

(2) 


i=l 


and fx is a function having VC-dimension h with the smallest empirical risk on 
a dataset {Xi,i = 1,2,...,/} of / data points with corresponding labels {yi,i = 
1 , 2 ,...,/}. 

Recently, it has been shown that a formulation termed as the Minimal Com¬ 
plexity Machine (MCM) [T9| can be used to realize a large-margin classifier while 
minimizing an exact (©) bound on the VC dimension. The approach requires 
the solution of a linear programming problem, and generalizes well on bench¬ 
mark datasets. The MCM outperforms SVMs in terms of test set accuracy, 
while using far fewer support vectors; in many instances, the MCM predicts 
better while using less than 10% the number of support vectors used by SVMs 
[ini Table III]. Variants of the MCM have been proposed for regression [22], 
fuzzy classification [^ and feature selection for large datasets |20j . 

Our focus in this paper is a neurodynamical system that converges to the 
MCM solution, thus yielding a minimal VC dimension classifier. A dynamical 
system that converges to a minimum VC dimension classifier allows for high 
speed and real-time implementation, e.g. as an analogue VLSI chip. Since this 
approach yields a system that has low complexity, it opens a large vista of appli¬ 
cations in the learning and modelling domains. The MCM solutions are usually 
very sparse; this provides the advantage of lower computational cost in a hard¬ 
ware implementation. These advantages carry over to VLSI implementations 
and are therefore of much interest. 

Applications based on dynamical systems have attracted significant attention 
over the last three decades, owing to the potential for real time, high speed 
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realizations as electronic circuits m or as recurrent neural networks I3S1I1I]- 
The behaviour of such neurodynamical systems is also interesting as it has been 
used in modeling biological systems [IHEtIIH!, solving optimization problems 
[321 [331 mis], large-scale problems fuzzy symbolic dynamics uni and working 
memory |29j among others. 

There have also been several works which integrate the linear/quadratic 
programming approach within Neurodynamical systems. For instance, Bennett 
and Mangasarian |26j proposed a technique for training neural networks using 
linear programming based on the Multi-surface Method, which was applied for 
breast cancer diagnosis. Faybusovich mini 113] proposed dynamical systems 
for solving linear programming based on barrier functions and presented their 
Hamiltonian analysis. Maa and Shanblatt [25j present a neural network formu¬ 
lation for linear and quadratic programming, extending the network originally 
proposed by Kennedy and Chua [23]. Jun Wang presented a recurrent neural 
network for solving Linear Programming Problems (LPP) |36j in 1993, which 
was followed by a neural network for solving LPPs with bounded variables by 
Xia and Wang [301 in 1995. In 1996, Wu et al. presented a neural network with 
global convergence guarantees [371EU- Other work in this direction includes the 
approaches presented by Oskoei and Amiri in 2006 |16j and by Chukwunenye 
in 2014 [7]. An overview of dynamical system methods for mathematical pro¬ 
gramming from a control perspective can be found in Bhaya and Kaszkurewicz 

Hi- 

Recent work on the application of dynamical systems involves solving LPs for 
estimation in the context of image restoration by Xia et al.[39j and solving the 
assignment problem [18]. Liu et al. [23] demonstrate the use of a neural network 
to solve a non-smooth optimization problem with linear constraints, while Perez- 
Ilzarbe |30j shows its use for solving a quadratic problem with linear constraints. 
In contrast the MCM formulation allows us to find a minimal VC dimension 
classifier utilizing a neurodynamical system that finds the optimal solution of a 
LPP, with guaranteed convergence and provably good generalization. 

The rest of the paper is organized as follows. Section [^ introduces the Min¬ 
imal Complexity Machine (MCM) and the associated optimization problem. 
Section [^ describes the MCM neurodynamical system, and an analysis of its 
convergence on synthetic datasets is shown in Section [^ Section [5] discusses 
simulation results. Section [^ contains concluding remarks. 

2. Motivating the Minimal Complexity Machine 

Consider such a binary classification problem with data points x^,i — 
1,2,..., AT, and where samples of class -1-1 and -1 are associated with labels 
Di — 1 and yi — —1, respectively. We assume that the dimension of the input 
samples is n, i.e. — {x\, x '^,..., x^^)^. The problem of interest is finding a 
hyperplane of the form 

u^x + rj = 0. (3) 
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that has the smallest Vapnik-Chervonenkis dimension 7 , and that separates the 
samples with least error. In m, it has been shown that there exist constants 
a, ,3 > 0, a, /3 6 K such that 


a.h? < 7 < 1 


(4) 


where 


Maxi—1 2 M \\u^x^ + I’ll 

h = - ’’•••’ " -5 

Mini=i,2,...,Af IIm^®* + f II 

In other words, constitutes a tight or exact {6) bound on the VC dimension 
7 . An exact bound implies that and 7 are close to each other. Thus, 
the machine capacity can be minimized by minimizing h? ^ or equivalently, h. 
The MCM optimization problem attempts to find a classifier with the smallest 
machine capacity, that makes as few misclassification errors on the training 
data as possible. This leads to a fractional programming problem, which, after 
suitable transformations, leads to the following optimization problem [19]. This 
transformation is discussed in detail in m App. A]. 


M 


Min + C • V Qi 

(6) 

i=l 


h>yi- + b] + g*, i — 1, 2, M 

(7) 

Hi ■ + b] + gi > 1, i — 1,2,..., M 

(8) 

Qi >0, i — 1, 2,..., M. 

(9) 


Here, the choice of C allows a tradeoff between the complexity (machine ca¬ 
pacity) of the classifier and the classification error. The soft margin MCM is 
described by the formulation Equations Q-(|^. 

Once w and b have been determined by solving Equations , the class 

of a test sample x may be determined as before by using the sign of f{x) in 
Equation (|T0|. 


f{x) — w^x + b ( 10 ) 

In ([^, we show how the MCM solution can be determined by a dynamical 
system. 

On similar lines, the kernel MCM obtains a hyperplane in (p space given by 

f{x) — w'^(f>{x) + b ( 11 ) 

where (p{) maps input vectors into a higher dimensional image space. The kernel 
MCM solves the following optimization problem. 
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M 


Min h + C-J^Qi 

'w^b^h^q 

•i=l 


h> Vi 


M 


XjK(x\x^) + b 

3 — ^ 


Vi 


M 


J2 \jK{x\x^) + b 

j=i 


+ Qi, i — 1, 2,..., M 

+ qi > 1, i = 1,2,...,M 


( 12 ) 

(13) 

(14) 


Qi > 0, i = 1, 2, ..., M. 


(15) 


Once the variables \j,j — 1, 2, ..., M and b are obtained, the class that a 
test point x belongs to can be determined by evaluating the sign of 


M 

f{x) — w^(f>{x)-\-b — XjK{x,x^) b. (16) 

j=i 


3. The MCM neurodynamical system 

The MCM implementation follows the approach of Nguyen [28], which solves 
a simple system of differential equations involving both primal and dual vari¬ 
ables. Consider a linear programming problem in the standard form as given 
by Equations Q-®- 


max q^O (17) 

e 

s.t. GO < p (18) 

and 6 > 0 (19) 


The dual is given by Equations (20 1 - (22). 


min p^d 
s 

s.t. G^S > q 
and ^ > 0 


( 20 ) 

( 21 ) 

( 22 ) 


where 6,q E K"', G E 


and d,p E 


The primal (resp. dual) network variables are denoted 6 (resp. S) and evolve 
in time as described by the pair of coupled linear ODEs in Equations (|2^-(24). 


dO „ 

^p-G^iS + k — ) (23) 

at at 

dS ^ dO^ 

-^-q-GiO + k — ) (24) 

at at 
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where A; is a positive constant. 

Supposing for the moment that the neural network defined by the Equations 
(23)-(24) converges to an equilibrium, it can be shown that the optimal solution 
for the primal and dual formulations in Equations ( |17[ )-( |2^ is an equilibrium 
of (p3|-(^, as follows: 


Let the element of 9 be denoted as 9i. The Equation (231 can be written 


as 


dOi ^Up-G'^id + k!^))i, if0i>OVi 

dt |max{(p — G^{d + k^))i, 0}, if = 0 Vi 

If the equilibrium solution is represented as 9* and 6*, then — 0 and 
= 0. Thus, for all i, 


and, 


Further, for all i, 


and, 


(p-G^r)i = 0, a 9* >0 
<0, if0* =0 
p - G^5* < 0 
G9* -q<Q 


(25) 

(26) 

(27) 

(28) 


Hence 9* and 5* are feasible solutions for the system defined by Equations 


(23)-(24). Also, we have 


p'^9* - 9*G^8* = 0 

(29) 

and, 


9*G^8* - q^8* = 0 

(30) 

which implies 


p'^9* = q'^8* 

(31) 


Hence 9* and 5* are optimal solutions for the system defined by Equations 


(23)-(24). 


Also, differentiating Equations (23)-(p4| we can write 


9 = -G'^{8 + k5) (32) 

8 = -G{9 + k9) (33) 


It remains to be proved that convergence to the equilibrium occurs. Elim¬ 
inating 5 (resp. 9) from Equations (32|-(33) yields a second order differential 
equation in 9 (resp.^), namely: 


(k^G'^G - 1)9 + 2kG^G9 + G'^G9 = -G'^q 
(k^GG'^ - 1)8 + 2kGG'^8 + GG^8 = -Gp 


6 


(34) 

(35) 



















X — (W(IXTI) b(ixi) q(ixM) h(ixi)) 

(36) 

Q — ([0](lxn) [0](lxl) C* X [l](lxA^) 


(37) 

• tp](Mxn) y(Mxl) [0](MxM) 

— [1](MX1)\ 

(38) 

— [X-1p]^Mxn) —y{Mxl) —[I](MxM) 

[0](Ad-xl) / 

^^(^(Mxl) — [l](Afxl)) 


(39) 


The asymptotic stability of these second order linear ODEs is determined 
by the properties of the coefficient matrices. For example, using [U Thm. 1], it 
follows that if k is chosen large enough to make the matrix k^G^G — I positive 
definite and if G^G is positive definite, then (Equation (34)) is asymptotically 
stable, implying that, from all initial conditions, its trajectories converge to 
the equilibrium point 9* (see Equation (25) ff.). One may note here that the 
assumption of one of the matrices G^G or GG^ being positive definite is a mild 
one, since it corresponds to assuming that there are no redundant inequality 
constraints. Finally if 6 converges, so must S, since we are assuming that both 
the primal and dual problems are feasible. 

Hence, for the MCM, the system of equations that finds a minimum VC 
dimension classifier aims at finding the equilibrium solution for the set of vari¬ 
ables represented by the augmented vector X — [w,b,q,h\. As mentioned 
initially, we consider data points a;*,i = 1,2,..., AT, associated with labels 
e {+1,-1}, i = 1,2,...,AT, and each data-point being n-dimensional. 
Let the set of data points be denoted by ipMxn, of which each row corresponds 
to a:*, and the label vector be denoted by a diagonal matrix T, with diagonal 
entries y^s, i.e. T = diay(yi, 1 / 2 , 2 / 3 , —,yM)- 

The system finds a solution for each of the (AT + n + 2) variables, as shown 
in Equation (36). Further, the LPP, as shown by Equations (|l7|-([T^, will now 
be defined by q, G and p as denoted by Equations (37)-([^, where the notation 
[T • Ip] represents the multiplication of matrices Y and ip; and T represents the 
identity matrix. 

The system to be solved can now be represented as shown by Equations 
(40)-(41), where fc > 0 is a free parameter that can be tuned, and Z is the dual 
variable of X. 


dX dZ 

+ (40) 

dZ ^ dX^ 

- = -,-G(X + .-) ,41) 


For the kernel MCM, the formulation can be obtained similarly by using 
cp{x^), where <p{) is a mapping to the chosen kernel space. The matrices X, q, 
G and p are then represented as shown in Equations (43)-(451. 
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G = 


X — ('i«(ixM) b(ixi) Q(ixiW) ^(ixi)) (42) 

9 = ([0](ixAf) [0](ixi) C X [l](ixJVf) 1(1x1)) (43) 

[T[cl){x‘^)^(f>{x^)]](MxM) y(Mxi) [0](M-xJVf) — [l](iWxl)\ 

— [T[(/)(a;*)^(/>(a;^)]](jVfxAi-) —y(Mxi) —[I](mxm) [0](m-xi) / 

(44) 

P=([0](ivixi) -[l](Mxi))'^ (45) 


G—( y(Mxi) [ 0 ](mxm) 

[T • K{x'^,x^)](^mxm) —y(Mxi) —[I]iMxM) 


-[1](A1X1) 

[OJcM-xi) 


(46) 


Also, in terms of kernel matrixa;!) = K{{x^)^, x^) — [(f){x^)^cf){x^)], 


the matrix G can also be written as shown in Equation (46). 

These can be substituted in Equations (dOl-pTl) and the system can be 


solved to obtain the equilibrium solution for the kernel case. 


4. Simulations of the MCM Neurodynamical System 

In order to visualize the convergence of the system of differential equations, 
we provide the plots showing the evolution of the decision variables of our sys¬ 
tem, namely Wi’s, b and h over time. We consider the case for two datasets 
(both two dimensional), a linearly separable dataset shown in Fig. la and a 


dataset with points (belonging to the two classes) randomly drawn from a nor¬ 
mal distribution, as shown in Fig. |lb| 

The plots for the decision variables for the linearly separable dataset are 
shown in Fig. The horizontal axis indicates the time in milliseconds. Figs. 
2 a and show the plots of Wx , W 2 and their derivatives Wx , W 2 respectively. 


Plots of convergence of b and its derivative b are shown in Figs. 


whereas those for h and its derivative h are shown in Figs. 


2 e 


2 c 


and 2 


and 


2 d 


The plots for the decision variables for the dataset with points drawn from 
a normal distribution are shown in Fig. The horizontal axis represents time 
in milliseconds. Figs, [^and 3b show the plots of lOi, W 2 and their derivatives 
u5i, W 2 resp ectiv ely. Plots of convergence of b and its derivative b are shown in 


Figs. 3c and 3d whereas those for h and its derivative h are shown in Figs, 
and [SB 


3e 


5. Results 

The MCM neurodynamical system was implemented in Matlab vR2013a and 
the code executed on a laptop running 64-bit Windows Operating System with 
Intel i3 processors @2.53 Ghz and 4 Gb RAM. 

8 













0.9 







+ Class 1 







+ Class -1 

f. 




+ 


:+ 


+ * 

+ 


+ 


+ 

+ 


+ 



+ 

+ 


. 

+ 

+ 





. ■ 



+ 





+ + 


+ 


+ 


\ 

+ 


+ : 

♦ 



* + 

+ 

* + 

* + 



+ 

* 


+ 

+ : 

. + . 

+• 

+ 

+ 

+ +: 

* 

+ 


(a) A Linearly Separable Dataset 



(b) Dataset with points drawn from nor¬ 
mal distribution 


Figure 1: (a) Example of a linearly separable dataset, and (b) a dataset with 
points drawn from a normal distribution. 


Tablej^shows the performance of the linear MCM dynamical system on a set 
of benchmark datasets from the UCI machine learning repository. The table also 
provides comparison with the standard SVM formulation in the linear case. For 
results in case of the linear MCM, see m- Accuracies are shown in a mean ih 
standard deviation format, computed using a standard five fold cross validation 
methodology. One can see that the MCM dynamical system outperforms the 
standard SVM in terms of test set accuracies. 

Table 1: Test Set Accuracies for the Linear MCM Dynamical System 


S. No. 

Dataset 

Size (samples X features) 

Linear MCM Dynamical System 

Linear SVM 

1 

Hayes Roth 

132 X 5 

76.11 ± 8.72 

73.56 ± 7.73 

2 

Hepatitis 

165 X 19 

69.35 ± 8.71 

60.64 ± 7.19 

3 

TA Evaluation 

151 X 5 

69.52 ± 6.92 

64.94 ± 6.56 

4 

Promoters 

106 X 58 

68.92 ± 6.91 

67.78 ± 10.97 

5 

Voting 

435 X 16 

95.97 ± 3.75 

94.48 ± 2.46 

6 

Australian 

690 X 14 

85.79 ± 2.59 

84.49 ± 1.18 

7 

Bands 

512 X 39 

72.58 ± 3.98 

71.69 ± 3.81 

8 

Fertility 

100 X 10 

86.00 ± 6.91 

86.00 ± 9.01 

9 

Spect 

267 X 22 

91.46 ± 4.28 

91.99 ± 4.90 

10 

Haberman 

306 X 3 

72.01 ± 3.54 

72.56 ± 3.73 

11 

Planning-Relax 

182 X 13 

72.41 ± 7.81 

71.42 ± 7.37 


Table shows the performance of the kernel MCM dynamical system on 
a set of benchmark datasets from the UCI machine learning repository. The 
table also provides a comparison with SVM using the RBF kernel. The hyper¬ 
parameter C was tuned by using grid search. A similar search was used to 
determine the width of the RBF kernel. As indicated previously, accuracies 
are shown in mean ih standard deviation format, computed using a standard 
five fold cross validation methodology. One can see that the MCM dynamical 
system yields comparable or better performance than the SVM. Further, it is 
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(a) Plot of and w- 2 . (b) Plot of ii5i and 




(c) Plot of b (d) Plot of b 




(e) Plot of h 


(f) Plot of h 


Figure 2: Plots of convergence of the decision variables w, b and h, and their 
first derivatives w, 6 and h with time, for the linearly separable dataset shown 
in Fig. 
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(e) Plot of h 


(f) Plot of h 


Figure 3: Plots of convergence of the decision variables w, b and h, and their 
first derivatives w, b and h with time, for the dataset with points drawn from 


a normal distribution, as shown in Fig. lb 
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observed that the kernel MCM always uses fewer support vectors; indeed, up 
to 74.3% fewer support vectors (computed on the average number of support 
vectors). It may be noted that the number of support vectors presented in Table 
[^have been shown in the mean ih standard deviation format, across the folds 
on which the accuracies have been computed, and hence the values are shown 
as floating point numbers. Observe that in rows 1, 2, 4, 6, 7, 8, 9, 10 of Table 
the proposed kernel MCM achieves a higher test set accuracy with a smaller 
number of support vectors than the standard kernel SVM. Since the number of 
support vectors has a signihcant bearing on the number of computations, the 
MCM can be seen to be parsimonious in terms of computational requirements. 
This also translates into lower power consumption figures in hardware and VLSI 
realizations [g[in]. 


Table 2: Test Set Accuracies and number of Support Vectors (#SVs) for the 
Kernel MCM Dynamical System (KMCM-DS) compared with standard RBF 
Kernel SVM (KSVM) 


S. No. 

Dataset 

Size (samples X 
features) 

KMCM-DS 
Test Set Acc. 

KMCM-DS #SVs 

KSVM 

Test Set Acc. 

KSVM #SVs 

1 

Spect 

267 X 22 

91.99 ± 4.90 

49.6 ± 0.54 

84.21 ± 4.90 

50.2 ± 9.88 

2 

TA Evaluation 

151 X 5 

80.86 ± 6.87 

26.60 ± 32.43 

68.88 ± 6.48 

86.00 ± 3.22 

3 

Fertility Diagnosis 

100 X 10 

88.00 ± 1.03 

9.80 ± 19.60 

88.00 ± 9.27 

38.20 ± 1.60 

4 

Hayes Roth 

132 X 5 

81.45 ± 7.98 

33.23 ± 1.11 

79.57 ± 6.60 

84.20 ± 2.04 

5 

Hepatitis 

165 X 19 

79.35 ± 4.09 

20.00 ± 0.00 

82.57 ± 6.32 

72.20 ± 4.31 

6 

Promoters 

106 X 58 

69.87 ± 7.85 

84.8 ± 0.44 

66.45 ± 6.52 

94.0 ± 0.70 

7 

Bands 

512 X 39 

77.88 ± 4.14 

341.2 ± 0.44 

75.69 ± 3.81 

427.6 ± 3.78 

8 

Planning-Relax 

182 X 13 

78.57 ± 8.23 

116.8 ± 0.54 

71.42 ± 8.43 

145.6 ± 6.45 

9 

Haberman 

306 X 3 

76.45 ± 4.37 

71.0 ± 0.414 

72.89 ± 4.58 

137.4 ± 3.36 

10 

Australian 

690 X 14 

76.95 ± 2.63 

152 ± 4.86 

66.23 ± 1.84 

244.8 ± 4.604 


6. Conclusion 

In this paper, we describe a Neurodynamical System that converges to a clas¬ 
sifier with minimum VC dimension. A learning machine with such properties 
is attractive for building circuits that can exploit the advantages of speed and 
parallelism that neurodynamical systems offer. It is also of interest as part of 
larger learning networks and adaptive control systems. Further work in this di¬ 
rection involves developing neurodynamical systems using MCMs for regression 
and other classihcation scenarios such as multilabel and multiclass problems. 
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